Dynamic Experience Replay (DER)

1 Overview

In Dynamic Experience Replay (DER), not only human demonstrations but also successful episodes can be used as demonstrations. For robot control task, human demonstration is not always perfect demonstration, so that DER auguments and replaces demonstrations by successful episodes. DER can work with distributed RL framework such Ape-X.

DER uses multiple (global) replay buffers \(\lbrace\mathbb{B}_0\dots\mathbb{B}_n\rbrace\) which have demonstration zone. On the episode end, explorers randomly picked one of the replay buffers \(\mathbb{B}_i\) and stores transisions. Additionally, if the episode succeeds, it is stored at another (ordinary) replay buffer \(\mathbb{T}\) for successful transitions.

Learner picks one of the replay buffers \(\mathbb{B}_j\) randomly, then samples mini-batch, trains network, and updates priorities as usual prioritized experience replay. Periodically, demonstration zones are replaced by randomly sampled transitions from successful transition replay buffer \(\mathbb{T}\).

2 With cpprb

DER requires specialized demonstration zone where transisions are keeped. Naively speaking, it can be realized by using 2 replay buffers. However, to sample from this joint replay buffer with prioritized way, cpprb need some modification. We might implement something like JointPrioritizedReplayBuffer.

3 References